Add support ORT whisper by mht-sharma · Pull Request #420 · huggingface/optimum

mht-sharma · 2022-10-12T12:31:29Z

What does this PR do?

This PR enables the export of Whisper model to ONNX.

To enable this new modality, I had refactored the existing ORTModelForConditionalGeneration to add support for the multimodal models.

The PR has a dependency on transformers PR 19525 which integrates the onnx config for the whisper model and adds support for the audio preprocessor.

Usage

Using Transformers AutoModelForSpeechSeq2Seq

>>> from datasets import load_dataset
>>> from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline

>>> processor = AutoProcessor.from_pretrained("openai/whisper-tiny.en")
>>> model = AutoModelForSpeechSeq2Seq.from_pretrained("openai/whisper-tiny.en")
>>> speech_recognition_pipeline = pipeline(
...    "automatic-speech-recognition",
...     model=model,
...     feature_extractor=processor.feature_extractor,
...     tokenizer=processor.tokenizer,
...  )

>>> ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
>>> inputs = processor(ds[0]["audio"]["array"], return_tensors="pt")
>>> result = speech_recognition_pipeline(ds[0]["audio"]["array"])

Using Optimum ORTModelForSpeechSeq2Seq

>>> from datasets import load_dataset
>>> from transformers import AutoProcessor, pipeline
>>> from optimum.onnxruntime import ORTModelForSpeechSeq2Seq

>>> processor = AutoProcessor.from_pretrained("openai/whisper-tiny.en")
>>> model = ORTModelForSpeechSeq2Seq.from_pretrained("openai/whisper-tiny.en", from_transformers=True)
>>> speech_recognition_pipeline = pipeline(
...    "automatic-speech-recognition",
...     model=model,
...     feature_extractor=processor.feature_extractor,
...     tokenizer=processor.tokenizer,
...  )

>>> ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
>>> inputs = processor(ds[0]["audio"]["array"], return_tensors="pt")
>>> result = speech_recognition_pipeline(ds[0]["audio"]["array"])

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you make sure to update the documentation with your changes?
Did you write any new necessary tests?

HuggingFaceDocBuilderDev · 2022-10-12T12:46:14Z

The documentation is not available anymore as the PR was closed or merged.

lewtun

Thanks for working on this great feature @mht-sharma 🚀 !!!

The API looks very good to me and most of my comments are nits. Did you manage to resolve the issue with the generations not matching the transformers model?

optimum/onnx/modeling_seq2seq.py

optimum/onnxruntime/modeling_seq2seq.py

lewtun · 2022-10-12T14:35:16Z

optimum/onnxruntime/modeling_seq2seq.py

+SPEECH_SEQ2SEQ_ONNX_MODEL_DOCSTRING = r"""
+    Arguments:
+        input_features (`torch.FloatTensor`):
+            Float values mel features extracted from the raw speech waveform.


Suggested change

Float values mel features extracted from the raw speech waveform.

Mel features extracted from the raw speech waveform.

lewtun · 2022-10-13T11:30:54Z

optimum/onnxruntime/modeling_seq2seq.py

    """

-    def __init__(self, session: onnxruntime.InferenceSession, device: torch.device):
+    def __init__(self, session: onnxruntime.InferenceSession, device: torch.device, main_input_name: str):


Is this a breaking change to the existing API (I'm not 100% sure)?

If yes, one option would be to use:

Suggested change

def __init__(self, session: onnxruntime.InferenceSession, device: torch.device, main_input_name: str):

def __init__(self, session: onnxruntime.InferenceSession, device: torch.device, main_input_name: str = "input_ids"):

This should not break anything. But added the default value for input name

optimum/onnxruntime/modeling_seq2seq.py

tests/onnxruntime/test_modeling.py

mht-sharma · 2022-10-25T11:20:38Z

QQ - Do we run the CI on gpu machine? The following test case tests the pipeline on GPU L1148. However, as per my understanding the pipeline is not running on GPU and it should fail or am I missing something.

JingyaHuang · 2022-10-25T11:24:43Z

QQ - Do we run the CI on gpu machine? The following test case tests the pipeline on GPU L1148. However, as per my understanding the pipeline is not running on GPU and it should fail or am I missing something.

Hey @mht-sharma, the CI for GPU is scheduled to run nightly.

mht-sharma · 2022-10-25T12:15:18Z

optimum/onnxruntime/modeling_seq2seq.py

+        return ORTEncoderForSpeechSeq2Seq(session=encoder_session, device=device, main_input_name=self.main_input_name)
+
+    def get_encoder_onnx_config(encoder_config: PretrainedConfig) -> OnnxConfig:
+        return SpeechSeq2SeqEncoderOnnxConfig(encoder_config, task="default")


Currently, the class ORTModelForSpeechSeq2Seq returns generic encoder / decoder ONNX configs. Same design is for the ORTModelForSeq2SeqLM class.

AutoModelForSpeechSeq2Seq contains multiple speech model classes within it. There may be some models which are not supported using the generic ONNX configs (Need to confirm)? Similar to AutoClasses we may need a structure to switch to appropriate Onnx Config based on the model types. Probably something we can work after refactor.

Since the whole ONNX part is also changing, I think it's fine to allow time for a refactor to support audio if needed

lewtun

Thanks a lot for making this beast of a model work with optimum @mht-sharma !

Out of curiosity, is there any significant slow down when running the ORT model in CPU/GPU?

If not, I think this PR is in great shape - the main question is whether we should merge + refactor for the new ONNX exporter in #403 or wait until that PR is merged (at the expense of delaying this feature)?

WDYT @michaelbenayoun @echarlaix ?

optimum/onnx/configuration.py

lewtun · 2022-10-26T12:29:11Z

optimum/onnxruntime/modeling_seq2seq.py

+        return ORTEncoderForSpeechSeq2Seq(session=encoder_session, device=device, main_input_name=self.main_input_name)
+
+    def get_encoder_onnx_config(encoder_config: PretrainedConfig) -> OnnxConfig:
+        return SpeechSeq2SeqEncoderOnnxConfig(encoder_config, task="default")


Since the whole ONNX part is also changing, I think it's fine to allow time for a refactor to support audio if needed

tests/onnxruntime/test_modeling.py

michaelbenayoun · 2022-10-26T15:24:23Z

@lewtun I think it makes more sense to merge the exporters PR first, since it lays the foundation for the onnx export in optimum.
Once this is merged, we can refactor and add whisper as the first contribution to this.

mht-sharma · 2022-11-10T09:03:13Z

The PR is now updated with the following changes.

Whisper configs have been added to exporters along with the SpeechSeq2Seq configs.
modeling_seq2seq.py in ORT is updated to support Whisper using the configs from exporters. The file has been hardcoded to support only Whisper through exporters as discussed with @michaelbenayoun .
IOBinding support is added. Currently there is a bug when running inference using past-key-values for BS>1 using iobinding. PR for the same will be added by @JingyaHuang soon.

Gently pinging team members for review. @lewtun @michaelbenayoun

optimum/exporters/onnx/convert.py

mht-sharma · 2022-11-10T09:16:16Z

optimum/onnxruntime/modeling_seq2seq.py

+        # bind logits
+        output_shape, output_buffer = self.prepare_output_buffer(
+            batch_size=input_features.size(0),
+            sequence_length=input_features.size(2) // 2,


This way to get the output shape seems to be too specific for the Whisper model and may not fit well for other SpeechSeq2seq models? Is there a way to avoid giving output shapes? @JingyaHuang

Or if this is the only way should we rename the class to ORTModelForWhisperConditionalGeneration? This may lead to have different classes for each model type in future. Probably something to discuss.

There is a way to avoid giving the output shape -> bind the output with OrtValue which will be the case for custom tasks #447 , but the flaw is that then you need to transfer ownership across frameworks which is something that we try to avoid. IMO, if you can infer the output shape, you shall bind it directly with a torch tensor. cc. @philschmid

This should then be handled case by case in the ORTModelForSpeechSeq2Seq which can return appropriate ORTModelEncoder based on the model_type.

Currently AutoModelForSpeechSeq2Seq contains three model types: whisper, speech_to_text and speech_encoder_decoder. So a simple if/else can be cleaner approach. WDYT @JingyaHuang @lewtun

class ORTModelForSpeechSeq2Seq: ... ... def _initialize_encoder( self, session: onnxruntime.InferenceSession, config: transformers.PretrainedConfig, device: torch.device, use_io_binding: bool = True, ) -> ORTEncoderForSpeechSeq2Seq: if config.model_type == "whisper": return ORTEncoderForWhisper(...) else: return ...

Second this. We could have a general model type -> ORTEncoder class mapping, and ORTModelForSpeechSeq2Seq would use this:

class ORTModelForSpeechSeq2Seq: ... ... def _initialize_encoder( self, session: onnxruntime.InferenceSession, config: transformers.PretrainedConfig, device: torch.device, use_io_binding: bool = True, ) -> ORTEncoderForSpeechSeq2Seq: return _MODEL_TYPE_TO_ORTENCODER.get(model_type, "default")(...)

tests/onnxruntime/test_modeling.py

michaelbenayoun · 2022-11-10T10:33:21Z

optimum/exporters/onnx/base.py

        "seq2seq-lm": OrderedDict({"logits": {0: "batch_size", 1: "decoder_sequence_length"}}),
        "sequence-classification": OrderedDict({"logits": {0: "batch_size"}}),
        "token-classification": OrderedDict({"logits": {0: "batch_size", 1: "sequence_length"}}),
+        "speech2seq-lm": OrderedDict({"logits": {0: "batch", 1: "sequence"}}),


Is it the "official" name?

We could take:

automatic-speech-recognition to match the pipelines

speech2text

@lewtun wdty?

I think the idea was to partially align with the underlying autoclass, but I agree automatic-speech-recognition would be more intuitive.

In general (not for this PR), I think we should take the opportunity to align more closely with the Hub tasks, e.g. seq2seq-lm could also be text2text-generation right?

Alright then I guess we can keep speech2seq-lm for now since the other names are aligned to the AutoClass, and maybe change that (if needed) for all the tasks in another PR.

optimum/exporters/onnx/model_configs.py

optimum/utils/input_generators.py

tests/onnxruntime/test_modeling.py

michaelbenayoun · 2022-11-14T12:28:57Z

optimum/exporters/onnx/base.py

        "seq2seq-lm": OrderedDict({"logits": {0: "batch_size", 1: "decoder_sequence_length"}}),
        "sequence-classification": OrderedDict({"logits": {0: "batch_size"}}),
        "token-classification": OrderedDict({"logits": {0: "batch_size", 1: "sequence_length"}}),
+        "speech2seq-lm": OrderedDict({"logits": {0: "batch", 1: "sequence"}}),


Alright then I guess we can keep speech2seq-lm for now since the other names are aligned to the AutoClass, and maybe change that (if needed) for all the tasks in another PR.

michaelbenayoun · 2022-11-14T12:33:11Z

optimum/exporters/onnx/base.py

        return False

+    @property
+    def torch_to_onnx_input_map(self) -> Mapping[str, str]:


I would make it clear that it is needed when the dummy input names and the exported input names do not match.

Updated the doctoring

michaelbenayoun · 2022-11-14T12:33:45Z

optimum/exporters/onnx/base.py

            # TODO: figure out a smart way of re-ordering potential nested structures.
            # to_insert = sorted(to_insert, key=lambda t: t[0])
            for name, dynamic_axes in to_insert:
+                name = torch_to_onnx_input_map[name] if name in torch_to_onnx_input_map else name


Suggested change

name = torch_to_onnx_input_map[name] if name in torch_to_onnx_input_map else name

name = self.torch_to_onnx_input_map.get(name, name)

michaelbenayoun · 2022-11-14T12:46:48Z

optimum/onnxruntime/modeling_seq2seq.py

+        # bind logits
+        output_shape, output_buffer = self.prepare_output_buffer(
+            batch_size=input_features.size(0),
+            sequence_length=input_features.size(2) // 2,


Second this. We could have a general model type -> ORTEncoder class mapping, and ORTModelForSpeechSeq2Seq would use this:

class ORTModelForSpeechSeq2Seq: ... ... def _initialize_encoder( self, session: onnxruntime.InferenceSession, config: transformers.PretrainedConfig, device: torch.device, use_io_binding: bool = True, ) -> ORTEncoderForSpeechSeq2Seq: return _MODEL_TYPE_TO_ORTENCODER.get(model_type, "default")(...)

* added whisper to exporters * Removed reduntant code * Added io binding for ORTModelForSpeechSeq2Seq

R4ZZ3 · 2022-11-19T13:51:26Z

Please change this line
model = ORTModelForSpeechSeq2Seq("openai/whisper-tiny.en", from_transformers=True)

to (add .from_pretrained)

model = ORTModelForSpeechSeq2Seq.from_pretrained("openai/whisper-tiny.en", from_transformers=True)

zara0m · 2023-02-09T15:20:38Z

Hello,

I have seen this discussion which shows how can we use pipeline for audios more than 30s and how to change task to transcribe; the chunk works for onnx too, but I couldn't change the task to transcribe in pipeline configuration for onnx.
I appreciate it if you could help me solve this.

Thanks!

mht-sharma requested review from echarlaix and lewtun October 12, 2022 12:31

lewtun reviewed Oct 13, 2022

View reviewed changes

lewtun mentioned this pull request Oct 14, 2022

Added onnx config whisper huggingface/transformers#19525

Merged

mht-sharma commented Oct 25, 2022

View reviewed changes

lewtun approved these changes Oct 26, 2022

View reviewed changes

mht-sharma marked this pull request as ready for review October 26, 2022 13:25

mht-sharma force-pushed the add-support-onnx-whisper branch 2 times, most recently from 2d0a7ca to 325668f Compare November 10, 2022 08:47

mht-sharma commented Nov 10, 2022

View reviewed changes

optimum/exporters/onnx/convert.py Outdated Show resolved Hide resolved

mht-sharma commented Nov 10, 2022

View reviewed changes

tests/onnxruntime/test_modeling.py Show resolved Hide resolved

mht-sharma requested a review from michaelbenayoun November 10, 2022 10:03

michaelbenayoun reviewed Nov 10, 2022

View reviewed changes

mht-sharma force-pushed the add-support-onnx-whisper branch from 3d97bba to af7ddd3 Compare November 11, 2022 12:28

michaelbenayoun reviewed Nov 14, 2022

View reviewed changes

tests/onnxruntime/test_modeling.py Show resolved Hide resolved

tests/onnxruntime/test_modeling.py Show resolved Hide resolved

michaelbenayoun reviewed Nov 14, 2022

View reviewed changes

mht-sharma changed the title ~~added support onnxruntime whisper~~ Add support ORT whisper Nov 15, 2022

mht-sharma added 7 commits November 15, 2022 09:26

added support onnxruntime whisper

69b0be2

Updated decoder export model

b59ac10

Updated docstring

945f235

updated tests for whisper

ccbafbc

add whisper onnx configs

88e6053

Added Whisper model to exporters

3bf3c99

* added whisper to exporters * Removed reduntant code * Added io binding for ORTModelForSpeechSeq2Seq

Removed unused imports

8677559

mht-sharma added 7 commits November 15, 2022 09:26

Added tests for exporters and iobinding

06632aa

Removed redundant line

2b84fff

Updated input generator and config

9520897

Updatedtests

4d6e313

added sample audio input

065a93e

Removed redundant code to fix test

512c781

Updated iobinding

14358a0

mht-sharma force-pushed the add-support-onnx-whisper branch from 63a255f to 14358a0 Compare November 15, 2022 09:29

Fix tests

dbba1f9

michaelbenayoun merged commit a29647e into huggingface:main Nov 15, 2022

hamsipower mentioned this pull request Jan 10, 2023

Incorrect transcription #685

Closed

4 tasks

	Float values mel features extracted from the raw speech waveform.
	Mel features extracted from the raw speech waveform.

	def __init__(self, session: onnxruntime.InferenceSession, device: torch.device, main_input_name: str):
	def __init__(self, session: onnxruntime.InferenceSession, device: torch.device, main_input_name: str = "input_ids"):

	name = torch_to_onnx_input_map[name] if name in torch_to_onnx_input_map else name
	name = self.torch_to_onnx_input_map.get(name, name)

Conversation

mht-sharma commented Oct 12, 2022 • edited by JingyaHuang Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Usage

Before submitting

Uh oh!

HuggingFaceDocBuilderDev commented Oct 12, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lewtun left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mht-sharma commented Oct 25, 2022

Uh oh!

JingyaHuang commented Oct 25, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lewtun left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

michaelbenayoun commented Oct 26, 2022

Uh oh!

mht-sharma commented Nov 10, 2022

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mht-sharma commented Oct 12, 2022 •

edited by JingyaHuang

Loading

HuggingFaceDocBuilderDev commented Oct 12, 2022 •

edited

Loading

mht-sharma Nov 15, 2022 •

edited

Loading

R4ZZ3 commented Nov 19, 2022 •

edited

Loading